Lecture 4 : Data Visualization¶

We will use Altair to explore the basic concepts of data visualization. Altair is an extremely powerfull library and it can cover a large range of needs (including fairly advanced interactive functionalities). Nevertheless we are more interested in understading the process behind a "proper" visualization rather than making a fancy visualization for the sake of it. At the end of the Markdown you'll find a selection of links that you can use to learn more about Altrai way beyond what is required in this course.

In [ ]:
import altair as alt
from vega_datasets import data
import pandas as pd

Initially we will work with the small iris dataset. You might have seen it before it contais measures for three different varieties of Iris (the flower).

Iris
In [ ]:
iris = data.iris()
In [ ]:
iris
Out[ ]:
sepalLength sepalWidth petalLength petalWidth species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

On a general level Altair needs to know:

  1. the data you want to work with
  2. the type of graph you want to create (called mark)
  3. how you want to map (encode) the mark to the space / attributes
In [ ]:
alt.Chart(iris).mark_point().encode(
    x='sepalLength',
    y='sepalWidth',
    color='species'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

As you can see the Altair takes care of everything else (colors, legend, axes etc). You can read the code as "I want to visualize iris data with a scatterplot where X represents sepalLength, Y represents sepalWidth and the color of the dot represents the species".

Now this visualization already makes some connections between the type of information and how to visualize it.

Compare it with the one below, What is the difference?

In [ ]:
alt.Chart(iris).mark_point().encode(
    x='sepalLength',
    y='species',
    color='sepalLength'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

So there is clearly a connection between the mark you are selecting and the data you need to visualize. At the same time some mapping can be "purely" aesthetic. De scribe this plot: what is visualized? what is not strictly necessary?

In [ ]:
alt.Chart(iris).mark_boxplot().encode(
    x='species',
    y='petalWidth',
    color='species'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Altair allows you to use transformers to perform operations directly during the visualization:

In [ ]:
alt.Chart(iris).mark_bar().encode(
    x='species',
    y='count()'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor
In [ ]:
alt.Chart(iris).mark_bar().encode(
    y='species',
    x='mean(sepalLength)'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

You can read more about the many transformer that are available here: https://altair-viz.github.io/user_guide/transform/index.html

a little more complexity¶

By default Altair understand the data type from Pandas. Nevertheless, there are cases when you need or want to specify the data type. You can do that by adding :LETTER after the data. Q = quantitative data N = nominal data O = ordinal data T = temporal data G = geographic sphere

You can read more about this: https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types

In [ ]:
alt.Chart(iris).mark_bar().encode(
    alt.X("sepalLength:Q", bin=True), #pay attention here. here we are passing the x as alt.X because we need to set some additional parameters. 
    y='count()'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Let's load our data¶

In [ ]:
df = pd.read_csv("data_cleaned.csv")
In [ ]:
df.head()
Out[ ]:
Unnamed: 0.1 Unnamed: 0 App Category Rating Reviews Size Installs Type Price($) Content Rating Genres Last Updated Current Ver Android Ver Price_rank Success Review
0 0 8884 "i DT" Fútbol. Todos Somos Técnicos. SPORTS NaN 27 3.6 500+ Free 0.0 Everyone Sports 2017-10-07 0.22 4.1 and up 8906.0 False too soon to call
1 1 8532 +Download 4 Instagram Twitter SOCIAL 4.5 40467 22.0 1,000,000+ Free 0.0 Everyone Social 2018-08-02 5.03 4.1 and up 8906.0 True good
2 2 324 - Free Comics - Comic Apps COMICS 3.5 115 9.1 10,000+ Free 0.0 Mature 17+ Comics 2018-07-13 5.0.12 5.0 and up 8906.0 False too soon to call
3 3 4541 .R TOOLS 4.5 259 203.0 10,000+ Free 0.0 Everyone Tools 2014-09-16 1.1.06 1.5 and up 8906.0 False too soon to call
4 4 4636 /u/app COMMUNICATION 4.7 573 53.0 10,000+ Free 0.0 Mature 17+ Communication 2018-07-03 4.2.4 4.1 and up 8906.0 False too soon to call

This is a large dataset. By default Altair prevents from visualizing more the 5000 rows. We need to disable this :

In [ ]:
alt.data_transformers.disable_max_rows()
Out[ ]:
DataTransformerRegistry.enable('default')

Barchart (binned)¶

In [ ]:
alt.Chart(df).mark_bar().encode(
    alt.X('Size:Q',bin=alt.Bin(maxbins=50)),
    y='count()',
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Faceted boxplot¶

In [ ]:
alt.Chart(df).mark_boxplot().encode(
    x='Success',
    y='Rating',
    color='Success',
    column='Type'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Timelines¶

In [ ]:
alt.Chart(df).mark_line().encode(
    x='Last Updated',
    y='count()',
    color='Content Rating'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor
In [ ]:
df.head()
Out[ ]:
Unnamed: 0.1 Unnamed: 0 App Category Rating Reviews Size Installs Type Price($) Content Rating Genres Last Updated Current Ver Android Ver Price_rank Success Review
0 0 8884 "i DT" Fútbol. Todos Somos Técnicos. SPORTS NaN 27 3.6 500+ Free 0.0 Everyone Sports 2017-10-07 0.22 4.1 and up 8906.0 False too soon to call
1 1 8532 +Download 4 Instagram Twitter SOCIAL 4.5 40467 22.0 1,000,000+ Free 0.0 Everyone Social 2018-08-02 5.03 4.1 and up 8906.0 True good
2 2 324 - Free Comics - Comic Apps COMICS 3.5 115 9.1 10,000+ Free 0.0 Mature 17+ Comics 2018-07-13 5.0.12 5.0 and up 8906.0 False too soon to call
3 3 4541 .R TOOLS 4.5 259 203.0 10,000+ Free 0.0 Everyone Tools 2014-09-16 1.1.06 1.5 and up 8906.0 False too soon to call
4 4 4636 /u/app COMMUNICATION 4.7 573 53.0 10,000+ Free 0.0 Mature 17+ Communication 2018-07-03 4.2.4 4.1 and up 8906.0 False too soon to call
In [ ]:
df['Month']=pd.to_datetime(df['Last Updated']) + pd.tseries.offsets.MonthEnd(1)
In [ ]:
df.head()
Out[ ]:
Unnamed: 0.1 Unnamed: 0 App Category Rating Reviews Size Installs Type Price($) Content Rating Genres Last Updated Current Ver Android Ver Price_rank Success Review Month
0 0 8884 "i DT" Fútbol. Todos Somos Técnicos. SPORTS NaN 27 3.6 500+ Free 0.0 Everyone Sports 2017-10-07 0.22 4.1 and up 8906.0 False too soon to call 2017-10-31
1 1 8532 +Download 4 Instagram Twitter SOCIAL 4.5 40467 22.0 1,000,000+ Free 0.0 Everyone Social 2018-08-02 5.03 4.1 and up 8906.0 True good 2018-08-31
2 2 324 - Free Comics - Comic Apps COMICS 3.5 115 9.1 10,000+ Free 0.0 Mature 17+ Comics 2018-07-13 5.0.12 5.0 and up 8906.0 False too soon to call 2018-07-31
3 3 4541 .R TOOLS 4.5 259 203.0 10,000+ Free 0.0 Everyone Tools 2014-09-16 1.1.06 1.5 and up 8906.0 False too soon to call 2014-09-30
4 4 4636 /u/app COMMUNICATION 4.7 573 53.0 10,000+ Free 0.0 Mature 17+ Communication 2018-07-03 4.2.4 4.1 and up 8906.0 False too soon to call 2018-07-31

Linechart - multiple lines¶

In [ ]:
alt.Chart(df).mark_line().encode(
    x='Month',
    y='count()',
    color='Content Rating'
)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Heatmap (and plot size)¶

In [ ]:
heatmap=alt.Chart(df,
          title="Types & Success | categorical heatmap").mark_bar().encode(
    x='Success',
    y='Type',
    color='count()'
).properties(
    width=500,
    height=500
)
heatmap
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Stacked barchart¶

In [ ]:
stacked=alt.Chart(df,
          title="Success | Stacked data").mark_bar().encode(
    x='Success',
    y='count()',
    color='Review'
).properties(
    width=500,
    height=500
)
stacked
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor
In [ ]:
stacked|heatmap
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

Interactivity¶

Altair has powerful interactive capabilities. You can read more on the specific resources but here we can still introduce few basic elements and show how to add basic interactive functionalities to your visualizations.

There are two concepts that are core to the basic of interactive visualization in altair:

Parameters are the basic building blocks in the grammar of interaction. They can either be simple variables or more complex selections that map user input (e.g., mouse clicks and drags) to data queries.

Conditions and filters can respond to changes in parameter values and update chart elements based on that input.

Let's see an example by adding a parameter to select the data.

In [ ]:
brush = alt.selection_interval()

Once we have done that we can set a specific element of the visualization to be conditional to the parameter:

In [ ]:
alt.Chart(df).mark_point().encode(
    x='Reviews:Q',
    y='Size:Q',
    color=alt.condition(brush,'Success:N',alt.value('lightgray'))
).add_params(brush)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor
In [ ]:
There are multiple ways of selecting:
In [ ]:
pointer = alt.selection_point()

alt.Chart(df).mark_bar().encode(
    x='Success:N',
    y='mean(Size):Q',
    color=alt.condition(pointer,'Success:N',alt.value('lightgray'))
).add_params(pointer)
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

and this can be combined in multiple effective solutions.

In [ ]:
interval = alt.selection_interval(encodings=['x'])

lines=alt.Chart(df).mark_line().encode(
    x='Month',
    y='count()',
    color='Content Rating'
).add_params(interval)

bars=alt.Chart(df).mark_bar().encode(
    x='Success',
    y='count()',
    color='Content Rating'
).transform_filter(interval)
lines.encode()|bars
Out[ ]:
Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor

More resources:¶

Getting started

Detailed user guide

Gallery with code

Tutorial - Video + Code

In [ ]:
from importlib.metadata import version
In [ ]:
version('altair')
Out[ ]:
'5.0.1'
In [ ]: